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Abstract 

Gene duplication may be an important mechanism for the evolution of new functions and for the adaptive modulation of 
gene expression via dosage effects. Here, we analyzed the fate of gene duplicates for two strains of a novel group of 
cyanobacteria (genus Acaryochloris) that produces the far-red light absorbing chlorophyll d as its main photosynthetic 
pigment. The genomes of both strains contain an unusually high number of gene duplicates for bacteria. As has been 
observed for eukaryotic genomes, we find that the demography of gene duplicates can be well modeled by a birth-death 
process. Most duplicated Acaryochloris genes are of comparatively recent origin, are strain-specific, and tend to be located 
on different genetic elements. Analyses of selection on duplicates of different divergence classes suggest that a minority of 
paralogs exhibit near neutral evolutionary dynamics immediately following duplication but that most duplicate pairs 
(including those which have been retained for long periods) are under strong purifying selection against amino acid change. 
The likelihood of duplicate retention varied among gene functional classes, and the pronounced differences between strains 
in the pool of retained recent duplicates likely reflects differences in the nutrient status and other characteristics of their 
respective environments. We conclude that most duplicates are quickly purged from Acaryochloris genomes and that those 
which are retained likely make important contributions to organism ecology by conferring fitness benefits via gene dosage 
effects. The mechanism of enhanced duplication may involve homologous recombination between genetic elements 
mediated by paralogous copies of recA. 
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Introduction 

Gene duplication is an important mechanism of gene inno- 
vation and genome evolution (Ohno 1970; Taylor and Raes 
2004). A substantial fraction of eukaryotic, bacterial, and 
archaeal genomes may be composed of divergent paralogs 
resulting from gene family expansion (Coissac et al. 1997; 
Jordan et al. 2001; Gevers et al. 2004; Makarova et al. 



2005), and examples of the role of gene duplicates as 
a source of raw material for the origin of evolutionary nov- 
elties and diversification abound (e.g.. True and Carroll 
2002; Irish and Litt 2005; Wagner 2008). 

In addition to ancient paralogs, eukaryotic genomes gen- 
erally contain a large number of recent duplicates (Lynch and 
Conery 2000, 2003). By contrast, although gene duplications 
can occur at frequencies as high as 10~ 3 per gene per 
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generation in bacterial genomes (Anderson and Roth 1977; 
Haack and Roth 1995; Reams et al. 2010), these duplicates 
are quickly purged from the genome unless they confer fit- 
ness advantages via dosage effects (i.e., enhanced gene ex- 
pression; Roth etal. 1996; Romero and Palacios 1997; Reams 
et al. 2010). Consequently, bacterial genomes typically har- 
bor few recent duplicates (Hooper and Berg 2003b). 

Here, we analyzed the age distributions and selection his- 
tories of duplicate genes in the genomes of two strains of 
the cyanobacterium Acaryochlohs which contain an unusu- 
ally large number of recent (i.e., low divergence) duplicates 
for bacterial genomes: the previously finished genome of 
Acaryochlohs strain MBIC11017 (Swingley et al. 2008) 
and a draft genome that we have assembled for/4caryocWo- 
ris strain CCMEE 541 0. Acaryochlohs spp. specialize on far- 
red wavelengths of solar radiation that are inaccessible to 
other photosynthetic organisms through their unique ability 
to produce chlorophyll (Chi) d, a structural relative of Chi a, 
as the major pigment in photosynthesis (Miyashita et al. 
1996; Miller et al. 2005). This recently discovered group 
has been detected in diverse marine, freshwater, and terres- 
trial habitats (Behrendt et al. 201 1) and may make a signif- 
icant contribution to the global carbon cycle (Kashiyama 
etal. 2008). Strain MBIC11017was isolated from the Great 
Barrier Reef (Miyashita et al. 1996), where Acaryochlohs bi- 
ofilms commonly develop underneath ascidians (Kuhl et al. 
2005). Strain CCMEE 5410 was isolated from a benthic epi- 
lithic biofilm in the Salton Sea (Miller et al. 2005), a saline, 
eutrophic closed basin lake in southern California with major 
inputs from agricultural runoff and municipal wastewater. 

We report that rates of duplication and duplicate loss fall 
within the range of values estimated for eukaryotic rather 
than bacterial genomes. Although duplicates may experi- 
ence a brief period of relaxed selection, most are rapidly lost 
from the genome, and those which are retained are subject 
to strong purifying selection. The idiosyncratic duplicate 
pools of the respective genomes include many open reading 
frames (ORFs) that appear to be important for fitness in the 
specific environments from which the strains were derived, 
including a large number of duplicates involved in iron ac- 
quisition in strain MBIC1 1017 and an enrichment of dupli- 
cated loci involved in heavy metal resistance in strain CCMEE 
5410. We conclude with consideration of the mechanisms 
which may contribute to the unusual duplication dynamics 
of these bacteria. 

Materials and Methods 

Acaryochloris Strain CCMEE 5410 Genome 

Cells were grown, and genomic DNA was isolated as previ- 
ously described (Swingley et al. 2008). The CCMEE 541 0 ge- 
nome was sequenced on the 454 FLX Titanium platform and 
assembled with Roche's Newbler de novo assembler with 
default overlap settings. The JCVI auto-annotation pipeline 



was used to identify sequence features and assign func- 
tional annotation. Protein-coding sequences were predicted 
with Glimmer3 (Delcher et al. 1 999), tRNAs were identified 
with the tRNAscan tool (Lowe and Eddy 1997), and rRNA 
genes and other structural RNAs were identified directly 
from Blast (Altschul et al. 1 990) matches to Rfam. Functional 
annotation of proteins was assigned based on coding se- 
quences comparison against the CHAR database of exper- 
imentally verified proteins and functional annotations, 
TIGRFAM (Haft et al. 2003) and Pfam (Finn et al. 2008) pro- 
tein family databases, the PANDA repository of nonredun- 
dant protein and nucleotide data, and by computationally 
derived assertions including lipoprotein and transmembrane 
helix signatures. Assembled contigs greater than 5 kbp in 
length were assigned to chromosome or plasmid elements 
by a nucmer alignment against the Acaryochlohs marina 
strain MBIC 11017 reference genome in the MUMmer pack- 
age (Kurtz et al. 2004). 

This whole-genome shotgun project has been deposited 
at DNA Data Bank of Japan/EMBL/GenBank under the ac- 
cession AFEJ00000000. The version described in this paper 
is the first version, AFEJ01 000000. 

Identification and Analysis of Recent Duplicates 

Paralogs within the genomes of Acaryochloris strains CCMEE 
541 0 and MBIC 1 1 0 1 7 were identified by local BlastP searches 
(Altschul et al. 1990) of each inferred protein sequence 
against its genome. Because the study was focused on the 
pools of recent duplicates, putative paralogs sharing less than 
50% amino acid identity were removed from the data set. A 
similar search strategy was used to identify shared duplicates 
via reciprocal local BlastP searches. ORFs annotated as trans- 
posases, integrases, or identified as having significant homol- 
ogy (E < 0.05) to insertion sequence (IS) elements by BlastP 
against the IS Finder database (www-is.biotoul.fr/is.html) 
were also removed, as were gene families with more than 
ten paralogs (typically transposases). 

Nucleotide alignments of duplicates were obtained by 
the manual adjustment of ClustalW automated alignments 
(Thompson et al. 1 994) using the amino acid alignments as 
a guide. Silent site divergence (c/ s ) and replacement site di- 
vergence (d N ) between aligned nucleotide sequences of du- 
plicate pairs were estimated by the maximum likelihood 
(ML) procedure implemented in the codeml program of 
the paml software package (version 3.14; Yang 1997). For 
all models, codon usage (the average nucleotide frequencies 
at the three codon positions) and transition/transversion bias 
were estimated from the data. Only duplicate pairs with d s 
< 5 were considered for further analysis. 

Most cases involved a duplicate pair resulting from a sin- 
gle duplication event. For cases involving more than two pa- 
ralogs, we used phylogenetics to distinguish the duplication 
events (e.g., resolution of three duplicates by reconstruction 
of the two duplication events). Phylogenies of aligned 
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nucleotide sequences were inferred by ML with PAUP* 
(Swofford 1996) according to a model of DNA sequence 
evolution selected by hierarchical likelihood ratio tests im- 
plemented by Modeltest (Posada and Crandall 1998). For 
the ML heuristic search, a starting tree was obtained by ran- 
dom sequence addition, and branch swapping was per- 
formed by tree bisection and reconnection. The resulting 
topology was used to specify the tree for the PAML model 
as described above. 

recA Phylogeny Reconstruction and Tests of Pro- 
tein Adaptation 

Nucleotides (1 ,023) of the recA genes of Acaryochloris and 
other representative cyanobacteria were aligned by Clus- 
talW (Thompson et al. 1994). A ML tree was reconstructed 
with PAUP* as described above according to the general 
time reversible (GTR) + I + G model of sequence evolution 
selected by Modeltest (Posada and Crandall 1 998) and boot- 
strapped 1 00 times. A Bayesian analysis was performed with 
MrBayes (Huelsenbeck and Ronquist 2001) using the GTR + 
I + G model. Two independent chains of 1,000,000 gener- 
ations of Markov chain Monte Carlo were analyzed, with 
trees sampled every 1,000 generations. Chain convergence 
was evaluated by the average standard deviation of split fre- 
quencies, and the first 20% of trees were discarded as burn- 
in. To test for the signature of positive selection during Acar- 
yochloris recA diversification, branch-site models of codon 
evolution (Yang and Nielsen 2002) were implemented with 
paml. Likelihood scores of nested models which either allow 
for a class of positively selected codon sites (i.e., d N /d s > 1) 
or constrain d N /d s to be less than or equal to 1 (the nearly- 
neutral model) were compared with a % 2 test. For branches 
of the recA tree for which the nearly-neutral model was re- 
jected, a Bayes empirical Bayesian analysis was used to infer 
which codon sites belonged to the positively selected class 
with high (>95%) posterior probability. 

Results and Discussion 

Acaryochloris Strain CCMEE 5410 Genome 

The Acaryochloris strain CCMEE 5410 genome was pyrose- 
quenced to approximately 24x coverage depth, and the re- 
sulting genome data assembled into 511 contigs greater 
than 500 bp, with an N 50 of 37,625 bp. The estimated ge- 
nome size of 7.88 Mbp is somewhat smaller than that of the 
previously finished genome of Acaryochloris strain 
MBIC 11017 (table 1 ; Swingley et al. 2008) as well as of a re- 
cently described strain isolated from the Great Barrier Reef 
for which an unpublished draft genome sequence has been 
obtained (-8.37 Mbp; Mohr et al. 201 0) but is still consid- 
erably larger than those of other unicellular cyanobacteria. 
The strain CCMEE 5410 genome likewise contains fewer 
predicted ORFs than that of strain MBIC 1 1 01 7. The two ge- 



Table 1 

General Features of the CCMEE 5410 and MBIC1 1017 Genomes 





CCMEE 5410 


MBIC11017 3 


Genome size (Mbp) 


7.88 


8.36 


GC content (%) 


47 


47 


ORFs 


8383 


8528 


Strain-specific ORFs 


2261 


2406 


IS elements 


552 


487 



a Data from Swingley et al. (2008). 



nomes share similar base composition and a high number of 
ORFs with significant homology to IS elements (table 1). 

The CCMEE 541 0 and MBIC 11017 genomes share 6, 1 22 
putative orthologs, with greater than 25% of predicted 
ORFs in each genome absent from the other (table 1). 
For the closed MBIC1 1017 genome, we can identify with 
certainty the genetic element on which each of these idio- 
syncratic ORFs resides. In addition to a circular chromosome, 
it contains nine apparently single-copy plasmids, varying in 
size from approximately 2.1 to 374 kbp, which together 
comprise roughly 22% of the genome (Swingley et al. 
2008). For the CCMEE 5410 assembly, we provisionally as- 
signed contigs greater than 5 kbp in length to either the 
chromosome or a plasmid element using a nucmer align- 
ment against the MBIC 1 1017 genome (supplementary ta- 
ble S1, Supplementary Material online). This length cutoff 
was chosen because most short contigs either exhibited 
no homology to the MBIC 11017 genome and/or encoded 
an IS element(s). One hundred and eighty-eight contigs to- 
taling 5.81 Mbp were assigned to the chromosome, and 61 
contigs with a cumulative size of 1 .52 Mbp were assigned to 
plasmids (supplementary table S1, Supplementary Material 
online). 

Gene content is generally conserved on the two Acaryo- 
chloris chromosomes. Approximately, 89% of ORFs on the 
MBIC11017 chromosome (5,621/6,342) have homologs in 
the CCMEE 5410 genome, whereas 83.5% of ORFs (4,951/ 
5,932) on contigs assigned to the CCMEE 541 0 chromosome 
have homologs in the MBIC 1 1 01 7 genome. Mapping of these 
chromosome contigs to the MBIC1 1017 reference indicated 
a high degree of sequence conservation and local synteny be- 
tween chromosomes (fig. }A; reference range data in supple- 
mentary table S1, Supplementary Material online). 

By contrast, differences in gene content between the ge- 
nomes are concentrated on plasmids. Seventy-seven per- 
cent (1,685/2,186) of MBIC 11017 plasmid ORFs have no 
homolog in the CCMEE 5410 genome, accounting for 
70% of the ORFs absent from the latter. The individual plas- 
mids vary in the fraction of ORFs with homologs in the 
CCMEE 541 0 genome from 0% (pREB9) to -48% (pREB4). 
Similarly, for CCMEE 5410 contigs assigned to a plasmid, 
55% of the 1,649 ORFs lacked a homolog in the 
MBIC1 1017 genome. In addition, few large blocks of syn- 
teny were observed among the MBIC 11017 plasmid ORFs 
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Fig. 1. — (A) Sequence and gene order conservation between Acaryochloris chromosomes illustrated for CCMEE 541 0 contig 453, which maps to 
positions 3.98-4.10 Mbp on the strain MBIC 11017 chromosome. They axis is the probability that a pair of aligned nucleotides is identical in state 
between genomes along a sliding window of 100 nucleotide sites with a 25 site step-size. Nonhomologous regions include transposases at the contig 
breakpoints, CCMEE 5410 ORFs missing in the MBIC1 1 01 7 genome (ORFs 191 and 205) and an ORF (ORF 187) which maps to coordinates 6.368- 
6.369 Mbp on the MBIC 11017 chromosome. (6) Sequence conservation between Acaryochloris genomes for CCMEE 5410 contigs homologous to 
MBIC11017 plasmid pREB4: contig 468 (blue), contig 500 (green), contig 510 (orange), contig 511 (gold), contig 576 (brown), contig 598 (plum). 
Approximately, half of pREB4 is missing from the CCMEE 5410 genome. The fraction of each contig that maps to pREB4 ranges from 34.5% (contig 
511) to 96% (contig 468). 



that were shared between the genomes (supplementary ta- 
ble S1, Supplementary Material online). The most extensive 
syntenic regions were clustered on plasmid pREB4 in a region 
spanning MBIC 11017 ORFs D0134-D0214 (fig. 15). Blocks 
of synteny from this plasmid include genes responsible for 
the biosynthesis and maturation of a bidirectional hydrog- 
enase (ORFs D0176-D0197; nucleotides 140334-159433) 
and a complete set of loci encoding an alternative ATP syn- 
thase (ORFs D0157-D0167; nucleotides 123957-132033). 
These results suggest a greater instability of the Acaryochlo- 
ris plasmids compared with the chromosome. 

Age Distribution of Duplicated Genes in Acaryo- 
chloris Genomes 

Both genomes are notable for their large number of recent 
paralogs. We identified 393 and 597 duplicate pairs with 
synonymous-site divergence (d s ) less than d s = 5 in the ge- 
nomes of Acaryochloris strains CCMEE 5410 and 
MBIC 11017, respectively. A majority of duplicated regions 



involve only a single protein-coding ORF; only —29% of 
pairs (N = 174) in the strain MBIC 11017 genome and 
-35% of pairs (N = 1 36) in the strain CCMEE 541 0 genome 
were a part of duplicated blocks of greater than one ORF. 

Most duplicates belong to the least divergent classes (d s < 1 ; 
fig. 2/4). The difference between strains in the observed num- 
ber of duplicate pairs is primarily due to a greater number of 
duplicates in these classes in the genome of strain MBIC 1 1017, 
which contains approximately double the number of duplicate 
pairs with d s < 0.5 (278 vs. 143). By contrast, the number of 
duplicate pairs with d s > 2 is similar between the genomes 
(121 vs. 102). For both Acaryochloris genomes, the number 
of duplicate pairs with d s < — 1 .5 is very large compared with 
other representative bacterial genomes (fig. 26; Hooper and 
Berg 2003b). For greater levels of d s , duplicate numbers are 
comparable, with the exception of an apparent enhanced den- 
sity of duplicates in Acaryochloris genomes centered on d s val- 
ues of -2-2.4 (fig. 2). 

Most duplicate pairs from the least divergent classes are 
strain specific, whereas more divergent duplicates are 
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Fig. 2. — (A) Frequency distributions of duplicate pairs for Acaryochloris strains MBIC11017 (blue) and CCMEE 5410 (green). (6) Frequency 
distributions of Acaryochloris duplicate pairs compared with data for Escherichia coli K12, Bacillus subtilis 168, and Pseudomonas aeruginosa PA01 
(from Hooper and Berg [2003b]). 



generally more likely to be present in both genomes (fig. 3). 
This pattern is in accord with the expectation that silent site 
divergence is generally a reasonable proxy for the age of 
a duplication event and that less divergent duplicate pairs 
have therefore largely originated following the divergence 
of these strain lineages from their common ancestor. How- 
ever, there are a number (N = 60) of low divergence (d s < 1 ) 
duplicate pairs in the strain CCMEE 5410 genome that also 
are found in the strain MBIC1 1017 genome. Such pairs may 
be the result of convergent duplication events following 
strain divergence or, alternatively, may appear "younger" 
than they are due to either gene conversion or extreme se- 
quence conservation at synonymous sites. We believe that 
slower than average evolutionary rates is of primary impor- 
tance for these loci because clear evidence from phyloge- 
netic analyses for either convergent evolution or gene 
conversion (i.e., paralogs clustering by strain) was observed 
for only a minority (N = 14) of duplicate pairs (data not 
shown). Among more divergent duplicates, approximately 
50% (77/166) of duplicate pairs of divergence level d s > 
1 in the strain CCMEE 5410 genome are present in the 
MBIC1 1017 genome (fig. 2). The unique duplicates among 
the more divergent age classes suggest that there has been 



0.9 - 




1 2 3 4 5 



d s 

Fig. 3. — Fraction of duplicates of different divergence levels in the 
Acaryochloris CCMEE 5410 genome that are shared with the strain 
MBIC1 1017 genome. 



the differential retention of older duplicates between ge- 
nomes following strain divergence. 

Estimation of Duplicate Birth and Death Rates 

Following the approach of Lynch and Conery (2000, 2003), 
we modeled each age distribution as a steady-state birth- 
death process in order to estimate the rates at which dupli- 
cates arise and disappear from the respective Acaryochloris 
genomes. Because the assumption of constant birth and 
death rates is more likely to be valid over a short time scale, 
we limited the analyses to duplicate pairs with silent site di- 
vergence less than d s = 0.1. For both data sets, we also ex- 
cluded duplicate pairs in these age classes (N = 7 pairs) 
found in both genomes (see above) to remove the potential 
impacts on the analysis of either gene conversion events or 
slowly evolving duplicates. We note that similar results were 
obtained for the full data set (not shown). 

Under a steady-state birth-death process, the instanta- 
neous rate of removal of duplicates from the genome (d) 
can be estimated by the slope of the linear regression of 
In n, on d s , where n, is the number of duplicate pairs in 
age class /'. The regression models explained most of the var- 
iation in both data sets (R 2 = 0.82, P < 0.0001 for Acaryo- 
chloris strain CCMEE 5410; R 2 = 0.76, P < 0.0001 for 
Acaryochloris strain MBIC11017), suggesting that the as- 
sumption of constant birth and death rates over this time 
interval is reasonable. Estimates of d (standard error [SE]) 
were not significantly different for the two strains: 8.0 
(2.14) for CCMEE 5410 and 7.8 (2.52) for MBIC1 1017. This 
corresponds to estimated half-lives (scaled to synonymous 
site divergence) of 0.087 and 0.089 for Acaryochloris strains 
CCMEE 5410 and MBIC 11017, respectively. That is, most 
duplicates are expected to be lost rapidly from the genome. 
These values are within the range observed among eukary- 
otic genomes (Lynch and Conery 2003). 

We estimated the duplicate birth rate B (the probability 
that a gene duplicates over the divergence period d s = 0.1 ) 
for each genome by B = (n B d x d s )/N (1 - e~ d x d s ), where 
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ever, most young duplicates as well as those which have 
been retained over longer periods appear to be under strong 
purifying selection against protein change: the median of 
d N /d s in the d s < 0.1 divergence level classes is —0.2 and 
-0.3 for the strain MBIC1 1017 and strain CCMEE 5410 ge- 
nomes, respectively. For duplicates of divergence level 
greater than d s = 1, mean (SE) value of d N /d s is 0.12 
(0.004) for the strain MBIC11017 genome and 0.09 
(0.004) for the strain CCMEE 5410 genome. Bearing in mind 
that the estimated strength of constraint represents the cu- 
mulative history of selection since duplication, this pattern 
indicates that, on average, the intensity of purifying selec- 
tion on duplicates increases over time. We conclude that the 
period of near-neutral evolutionary dynamics is at most brief 
following gene duplication, applies to only a subset of du- 
plicate pairs, and usually is followed by either purging from 
the genome or an increase in selection against protein 
change. These results are similar to those obtained for other 
bacteria (Hooper and Berg 2003b) as well as for eukaryotic 
genomes (Lynch and Conery 2000, 2003; Aury et al. 2006). 



Fig. 4.— Selection (c/ N /c/ s ) on duplicates in the strain CCMEE 5410 
(A) and MBIC1 1017 (B) genomes. Note the different scales on they axis 
for (A) and (B). 

n B is the number of duplicate pairs observed at a divergence 
level below d s = 0. 1 , and N is the total number of genes in 
the analysis excluding excess duplicates. The duplicate birth 
rate of strain MBIC1 101 7 was estimated to be between two 
and three times greater than that of strain CCMEE 5410 
(0.023 vs. 0.010). We conclude from the above analyses 
that the observed differences in the frequency distributions 
of recent duplicate pairs in the genomes of the two strains 
can be solely explained by differences in their duplication 
rates. 

Selection on Duplicate Pairs 

The idea that redundant gene copies experience a period of 
relaxed selection (i.e., d N /d s « 1) following duplication is 
central to early models of the evolution of novel function 
(e.g., Ohno 1970). For both Acaryochloris genomes, a mi- 
nority of duplicate pairs does appear to be under relaxed 
selective constraints immediately following duplication 
(fig. 4); for duplicates with a divergence level of d s < 
0.1, mean (SE) values of d N /d s are 0.45 (0.067) and 0.48 
(0.084) for the strain MBIC11017 and strain CCMEE 
5410 genomes, with approximately 25% of duplicate pairs 
having d N /d s > 0.5 in both genomes. A small number of 
these duplicates (four in the strain CCMEE 5410 genome, 
nine in the strain MBIC11017 genome) have d N /d s > 1, 
which suggests that they may be under positive selection. 
With one exception, the duplication of a chorismate mutase 
gene in strain MBIC1 1017, all of these duplicates are anno- 
tated as hypothetical or conserved domain proteins. How- 



Physical Location of Duplicated Genes 

The location of duplicates at (or near) the time of birth may 
provide clues regarding the substrates and prevailing mech- 
anisms responsible for duplicate formation. Few duplicates 
(—3% of duplicate pairs) are in tandem (operationally de- 
fined here as being within five ORFs of each other) at present 
in either Acaryochloris genome. 

The closed genome of Acaryochloris strain MBIC 1 1 01 07 
enabled a comprehensive investigation of the distribution of 
duplicates on the chromosome and on extrachromosomal 
elements, respectively. For the least divergent classes, at 
least one gene copy resides on a plasmid for most duplicate 
pairs (fig. SA and B), with both on plasmids for greater than 
60% of duplicates with a synonymous divergence level of d s 
< -0.5. Because duplicates might move over time, the lo- 
cations of the least divergent duplicates are expected to be 
most representative of where they originated. Of the 133 
duplicate pairs of divergence level d s < 0.1, both members 
are found on the same genetic element (chromosome or 
plasmid) only —1 6.5% of the time. Similarly, 13 of 21 iden- 
tical (i.e., d s = d N = 0) duplicate pairs are located on differ- 
ent elements, and of the eight which are on the same 
element, six likely originated as part of the same duplication 
event on plasmid pREB3. The origin of most duplicates 
therefore appears to involve recombination between differ- 
ent plasmids (67/133) or between a plasmid and the chro- 
mosome (44/133). 

Chromosome-chromosome pairs make a substantial 
contribution to the pool of duplicates from more divergent 
classes, however, with 40% of duplicates both residing on 
the chromosome at divergence levels greater than d s = 2 
(fig. 5Q. Plasmid-plasmid duplicates are nearly absent in 
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Fig. 5. — Frequency distributions of duplicate pairs in the Acaryo- 
chloris strain MBIC 11017 genome for duplicate pairs for which both 
copies currently reside on one or more plasmids (A), one copy is on 
a plasmid and the other is on the chromosome (B), or both copies are on 
the chromosome (Q- 

these classes (fig. 4/4). Although most gene duplication 
events involve interplasmid or plasmid-chromosome ex- 
change, it therefore appears that the vast majority are des- 
tined for loss from the genome. Duplicates that are retained 
over the long term tend to either originate on the chromo- 
some or end up there. 

We reach a similar conclusion for the CCMEE 5410 ge- 
nome (supplementary fig. S1, Supplementary Material on- 
line), although we could not assign one or more copies 
of a duplicate pair to a genetic element for 25% of dupli- 
cates. Most of these unassigned pairs belonged to low 



0.1 i 




JKLDOMNPTCGEFH IQVRS 



COG functional class 

Fig. 6. — Distributions of clusters of orthologous groups (COG) 
functional classes for the strain CCMEE 5410 genome (blue) and for 
duplicate pairs of divergence level d s < 5 (green). 

divergence classes (d s < 1 .0), with one-third from a diver- 
gence class of d s < 0.1. Inability to resolve the locations of 
these duplicates was due to the presence of one or both 
copies on a short contig and is likely responsible for the 
observed lower than expected density of interplasmid and 
plasmid-chromosome pairs in these low divergence classes 
(supplementary fig. S1>4 and B, Supplementary Material 
online; compare with fig. 2A). The placement on short 
contigs suggests that they are flanked by repetitive DNA 
(including IS elements) that may have served as substrates 
for recombination. 

Duplicate Retention 

Bacterial genomes may exhibit a biased retention of dupli- 
cates from different gene functional classes (Gevers et al. 
2004). Analysis of the strain CCMEE 541 0 genome indicated 
differences in the likelihood of retention among clusters of 
orthologous groups (COGs) functional classes (fig. 6). In par- 
ticular, the pool of duplicated genes (d s < 5) is enriched in 
members from the transcription (K), carbohydrate transport 
and metabolism (G), ion transport and metabolism (P), sig- 
nal transduction (T), and unknown (S) functional classes 
compared with their genome-wide frequencies. Conversely, 
there is a general paucity of duplicated genes involved in 
translation (J), replication, recombination and repair (L), cell 
wall/membrane/envelope biogenesis (M), amino acid trans- 
port and metabolism (E), and coenzyme transport and me- 
tabolism (H). This suggests that gene dosage balance may 
generally be more critical within these classes, with duplica- 
tion of individual genes strongly selected against. 

The observed biased retention of recent gene duplica- 
tions in the G, K, and P classes, as well as a deficiency of 
H, J, and M classes, is in accord with general longer term 
evolutionary trends revealed for paralogous gene family ex- 
pansion in a survey of 48 bacterial genomes (Gevers et al. 
2004). The retention of signal transduction (T) and transcrip- 
tion factors (K) is also a feature of plant genomes following 
polyploidization (Blanc and Wolfe 2004; Maere et al. 2005; 
Chapman et al. 2006; Thomas et al. 2006). 
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Table 2 

Select Strain-Specific Duplicates in the Strain MBIC11017 Genome 

ORFs a Annotation d s CCMEE 5410 

Nutrient acquisition 

0473/A0147 Fe 2+ -tra nsporter feoB 0.26 7582 



0474/A0146 


Fe 2+ -transporter feoA 


0.11 


7581 


3038/(B0139/F0079) 


Fur transcriptional regulator 


0.55/0.30 


2939 


3040/F0079 


Putative Fe 2+ -transporter 


0.21 


2937 


3348/A0161 


Fe 3+ -dicitrate ABC transporter 


1.36 


0699 


3349/A0162 


Fe 3+ -dicitrate ABC transporter 


1.23 


0700 


3350/A0163 


Fe 3+ -dicitrate ABC transporter 


1.23 


0701 


3401/A0182 


Fe 3+ -dicitrate ABC transporter 


0.22 


0727 


3402/A0183 


Fe 3+ -dicitrate ABC transporter 


0.28 


0728 


3403/A0184 


TonB-dependent siderophore transporter 


0.15 


0729 


3416/A0185 


Ferrichrome ABC transporter 


0.31 


0738 


C0108/C0205 


Heme oxygenase (Fe-recycling) 


0 




3533/3534 


Ammonium transporter 


0.003 


8087 


Light harvesting 








1368/3655 


Iron deficiency light antenna pcbC 


0 


8040 


C0093/C0216 


Phycobilisome linker protein 


0.02 




C0094/C0215 


Phycobilisome 32.1 kDa linker 


0 




C0096/C0213 


Phycocyanin, alpha subunit 


0 




C0098/C0212 


Phycocyanin, beta subunit 


0 




C0099/C0191 


Phycocyanin, alpha subunit 


0 




C0100/C0192 


Phycocyanin, beta subunit 


0.01 





a ORFs on plasmids are preceded by a letter indicating plasmid identity. 



Unique duplicates retained by the respective genomes 
may confer environment-specific fitness benefits through 
dosage effects, a phenomenon frequently observed in lab- 
oratory populations of bacteria (Roth et al. 1996; Romero 
and Palacios 1997; Reams and Neidle 2003). A correlation 
between duplicate content and environment has also been 
observed for yeast (Ames et al. 201 0). The genome of strain 
MBIC 11017 possesses a striking number of duplicated 
genes involved in nutrient acquisition (principally the bind- 
ing, transport, and metabolism of iron) that exist as either 
single copies or are not found in the strain CCMEE 5410 ge- 
nome (table 2). All but one of these include a plasmid-en- 
coded duplicate copy. We note that the strain MBIC 1 1017 
genome also includes eight plasmid-encoded single-copy 
iron acquisition genes that are absent from the strain 
CCMEE 5410 genome (ORFs A0156, A0157, A0172, 
A0197, A0198, A0274, B0123, B0125). 

That this strain's genome may have been shaped by iron 
limitation is also suggested by the recent duplication of the 
light antenna protein pcfaC(table 2). This gene is upregulated 
by Acaryochloris cells under conditions of iron deficiency 
(Chen et al. 2005), and PcbC protein subunits produce 
a light-harvesting antenna for photosystem I that may com- 
pensate for the reduction in the level of this photosystem rel- 
ative to photosystem II that occurs during iron stress. 

Tropical Pacific waters generally appear to be low in iron 
(e.g., Coale et al. 1996; Behrenfeld et al. 2006). Although 
we do not know the iron concentration of the local environ- 
ment from which strain MBIC1 1017 was isolated, there are 



reasons to believe that Acaryochloris cells may be iron lim- 
ited in their natural habitat. This strain was isolated from un- 
derneath the ascidian, Lissoclinum patella (Miyashita et al. 
1996), which belongs to a suborder (Aplousobranchia), 
which includes members notable for the accumulation of 
high concentrations of iron from the environment in blood 
cells called ferrocytes (Endean 1955). In addition, the posi- 
tive response of MBIC 11017 laboratory cultures to heavy 
iron addition suggests an organism with high demand for 
this nutrient (Swingley et al. 2005). 

Other recent duplicates in the MBIC 1 1 01 7 genome that 
are involved in light-harvesting encode pigment and scaffold 
components of phycobiliproteins (table 2), the major acces- 
sory pigments in photosynthesis for most cyanobacteria. 
Multiple duplicate copies of genes for the phycobiliprotein 
phycocyanin, which specifically harvests yellow-orange 
light for photosynthesis, as well as linker proteins essential 
for the assembly of phycobiliprotein rods, are located on 
plasmid pREB3 (Swingley et al. 2008). Strain MBIC11017 
produces phycobiliproteins under low light conditions in 
the laboratory (Chan et al. 2007). By contrast, strain CCMEE 
541 0 does not produce phycobiliproteins (Chan et al. 2007), 
and these genes are missing entirely from its genome (table 
2). This pattern suggests differences in the availability of yel- 
low-orange light in the two environments. These wave- 
lengths appear to be available at low levels in the natural 
environment of strain MBIC 11017 (Kuhl et al. 2005), 
whereas they may be more rapidly attenuated in the turbid 
Saltan Sea environment from which strain CCMEE 5410 was 
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Table 3 

Select Strain-Specific Duplicates in the Strain CCMEE 5410 Genome 





Annotation 


"s 


i\/iri/~i ini7 

IVI DIL I I U I / 


Carbon metabolism 








0720/(2355/2497) 


Fructose-bisphosphate aldolase 


2.59/1.48 


3372 


2488/(1772/2358) 


Xul5P/Fru6P phosphoketolase 


2.69/2.27 


0443 


1 77 A/2356 


Acetate kinase 


2.57 


0445 


2357/2490 


Phosphoglycerate mutase family 


1.58 


— 


4274/5615 


Phosphoglycerate mutase family 


1 .03 




2365/2496 


Putative glycogen phosphorylase 


0.84 




Copper resistance 








2343/2458 


Cu resistance protein CopA 


0.31 




2364/2487 


Copper-translocating ATPase 


1.10 




2372/2481 


Copper-translocating ATPase 


1.21 




Defense mechanisms 








3189/7004 


RND family multidrug efflux 


2.22 


2480 


1784/6258 


RND family multidrug efflux 


0.48 


0454 


Redox homeostasis 








2586/(2383/2469) 


Glutaredoxin 


1.69/0.84 


3463 



a ORFs assigned to plasmids are italicized. 



isolated (Miller et al. 2005) by phycobiliprotein-producing 
plankton (Wood et al. 2002) and inorganic particulate mat- 
ter (Swan et al. 2007) in the overlaying water column. 

The Saltan Sea is a phosphorus-limited system character- 
ized by high concentrations of dissolved organic carbon and 
nitrogen (Schroeder et al. 2002), as well as of iron, primarily 
as reduced particulates (Holdren and Montaho 2002; de 
Koff et al. 2008). Heavy metal (including cadmium, copper, 
selenium, and zinc) concentrations are also high (Vogl and 
Henry 2002; LeBlanc and Schroeder 2008). The unique du- 
plicate pool of the strain CCMEE 5410 genome (table 3) is 
enriched in loci involved in organic carbon metabolism and 
heavy metal resistance (in particular, copper). Many dupli- 
cate copies were found on two inferred plasmid contigs en- 
coding ORFs 2340-2395 and ORFs 2456-25 1 2, respectively. 
Similarly, a number of single-copy genes in the strain CCMEE 
5410 genome in large (>50 kbp), plasmid-assigned regions 
of no apparent homology with strain MBIC1 1017 are also 
involved in heavy metal (primarily copper) resistance (ORFs 
2382 [copper-translocating ATPase], 2458 [CopA family 
copper-resistance protein], 7833 [copper-resistance protein 
precursor CopB], 0016 [CzcA family heavy metal efflux 
pump]). 

A Role for recA Dosage in Gene Duplication? 

A mechanistic understanding of the gene duplication dy- 
namics of Acaryochloris genomes must ultimately account 
for both their high load of recent duplicates compared with 
other bacteria and the observed differences in the duplicate 
age distributions of the two strains. The recombination pro- 
cess is a likely candidate for involvement in duplication be- 
cause homologous recombination functions are generally 
important for both duplicate formation (Hill et al. 1977; 
Dimpfl and Echols 1 989; Petit et al. 1 991 ) and loss (Anderson 



and Roth 1979), particularly for long recombining sequen- 
ces such as IS elements (but see Reams et al. [201 0], e.g., in 
which duplication depends only weakly on homologous 
recombination). The large number of IS elements in Acar- 
yochloris genomes (table 1 ) provide potential substrates for 
recombination. Although there appears to be a general 
trend against the retention of duplicates involved in 
DNA replication, recombination, and repair (fig. 6), both 
genomes contain a number of duplicated genes from this 
functional class, and we briefly consider here whether 
these duplicates may play a role in the enhanced duplica- 
tion dynamics of these genomes. 

Most notably, there are an unusually large number of re- 
cA copies in both Acaryochloris genomes. RecA is a multi- 
functional protein that is central to homologous 
recombination, is involved in recombination-mediated 
DNA damage repair and rescue of stalled replication forks, 
is required for mutagenesis mediated by translesion synthe- 
sis, and regulates gene expression through its coprotease 
activity (reviewed by Miller and Kokjohn 1990). The strain 
MBIC1 1017 genome contains seven recA copies (Swingley 
et al. 2008), whereas there are four complete copies in the 
genome of strain CCMEE 541 0. The CCMEE 541 0 genome 
also includes a truncated copy (ORF 6290) with a nonsense 
mutation at codon 241 produced by an apparent transpo- 
sition event that results in the loss of part of the ATPase core 
and the C-terminal domain; the putative 3 ' end of the gene 
copy is found on a different contig (ORF 8203) and is also 
adjacent to a transposase. In contrast, recA exists as a single 
copy in the vast majority of bacterial genomes; the only 
known exceptions are the Acaryochloris genomes and those 
of Myxococcus xanthus (two copies; Norioka et al. 1995), 
Bacillus megaterium (two copies; Nahrstedt et al. 2005), 
and Deinococcus deserti (three copies; de Grootet al. 2009). 
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Fig. 7. — Unrooted Bayesian phylogeny of Acaryochloris recA duplicates. Values at a node represent the Bayesian clade credibility followed by the 
bootstrap value for a ML analysis. MBIC1 1 1017 copies are green and CCMEE 5410 copies are blue. 



Escherichia coli exhibits a 10-fold or greater tandem du- 
plication rate if RecA is constitutively activated (Dimpfl and 
Echols 1989), and overexpression of its eukaryotic homolog 
RAD51 may also enhance duplication rate as well as gener- 
ally increase genome instability (reviewed by Klein 2008). 
Whether the greater recA copy number in Acaryochloris ge- 
nomes results in enhanced expression remains to be deter- 
mined, but the association between copy number and strain 
duplication rate is consistent with a dosage effect. Also con- 
sistent with this possibility, the D. deserti genome likewise 
appears to contain a greater number of paralogs (—100- 
200) than those of its single-copy congeners, D. radiodurans 
and D. geothermalis (de Groot et al. 2009). 

Acaryochloris recAs are both extremely diverse and mono- 
phyletic, indicating that this diversity likely originated solely 
during Acaryochloris diversification rather than by horizontal 
gene transfer (HGT) (fig. 7). Three chromosomal copies are 
shared by the strains and appear to predate divergence from 
their common ancestor, whereas the strains vary in the num- 
ber of plasmid-borne copies. Although on average all copies 
have experienced strong purifying selection (d N /d s = 0.05), 
there is some evidence that certain amino acid substitutions 
have been selectively favored during recA diversification. 
Along two branches (labeled A and B in fig. 7), branch-site 
models of codon evolution (Yang and Nielsen 2002) which 



allow for positive selection on one or a few codon sites on 
specific branches of a phylogeny had significantly greater like- 
lihood values than nearly-neutral models constrained to d N /d s 
< 1 for all codons (2AL = 70.42, P = 0 for the Branch A 
model; 2AL = 9.16, P = 0.01 for the Branch B model). 
The codons estimated to have experienced positive selection 
(i.e., d^ds > 1 with a posterior probability P > 0.95 by Bayes- 
ian analysis) at some point during recA diversification (supple- 
mentary fig. S2; Supplementary Material online) include sites 
that participate in monomer-monomer interactions in the 
RecA filament (codons 105, 114, 115, 127, 153, and 
240), make contact with ssDNA-binding sites (codon 153), 
or change the properties (e.g., charge) of the C-terminal do- 
main of the protein (codons 312, 323, and 328), which is 
known to autoregulate RecA activity and to bind dsDNA dur- 
ing homologous recombination (Cox 2007). Whether these 
changes have consequences for RecA structure and function 
remains to be investigated, as does the possibility that diver- 
sification has yielded paralogous RecAs with nonredundant 
functions (i.e., subfunctionalization) in Acaryochloris cells. 

Concluding Remarks 

Strain-specific duplicates concentrated on plasmids make 
a substantial contribution to gene content differences 
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between Acaryochloris genomes and appear to be selec- 
tively retained in their respective contemporary environ- 
ments by favorable dosage effects. These differences are 
in part the product of the differential retention of dupli- 
cates of chromosomal origin (fig. 4B; see below). The lower 
degree of conservation of gene content on plasmids com- 
pared with the chromosome also suggests an important 
role for HGT in Acaryochloris evolution. If this is the case, 
the implication is that the ultimate source of many dupli- 
cate pairs is a single-copy gene of foreign origin. In Proteo- 
bacteria and Firmicutes, horizontally transferred genes do 
appearto be more likely to be duplicated (Hooperand Berg 
2003a). 

To obtain a conservative estimate of the contribution of 
HGT to the pool of strain-specific duplicates with at least 
one copy on a plasmid, we performed BlastP analyses 
against the NCBI Blast nonredundant protein sequence da- 
tabase for each genome. For a given strain-specific duplicate 
family without an ortholog in the other strain, it can be dif- 
ficult to unequivocally determine whether it is the product of 
the differential retention of an ancestral gene or of HGT. This 
is because many of these loci either exhibit greatest se- 
quence similarity to a different cyanobacterium or have 
no similarity to another sequence in the database (i.e., 
are orphan genes). Therefore, taking a conservative ap- 
proach and using an E cutoff value of 1 0~ 20 , we considered 
a duplicate family to be of vertical origin if the top non-Acar- 
yochloris hit for a duplicate family was a cyanobacterium, to 
have originated by HGT if the top hit was another taxon and 
to be of unknown origin if it was an orphan. 

For duplicate pairs for which one copy is on the chromo- 
some (fig. 46), most are inferred to be of cyanobacterial or- 
igin in both Acaryochloris genomes by the above criteria 
(62% for strain MBIC1 1017 and 77% for the subset of du- 
plicates in strain CCMEE 541 0 which could be fully assigned 
to genetic elements). This is the expectation if the plasmid 
copy was derived by duplication of a chromosomal tem- 
plate. Fewer duplicates in this category appear to involve 
horizontally transferred loci (4% and 3%, respectively). 
For interplasmid duplicates, however, a larger fraction 
shows highest similarity to a taxon other than a cyanobacte- 
rium and likely owes its origins to HGT (8% and 14%, re- 
spectively). For example, CCMEE 5410-specific duplicate 
pairs ORF2364/ORF2487 and ORF2365/ORF2496 (table 3) 
exhibit greatest sequence identities to Thermus thermophi- 
lus (67%) and a planctomycete bacterium (60%), respec- 
tively, and the former has no known homolog among 
other cyanobacteria. We believe that these HGT estimates 
are probably very low, as approximately half of the dupli- 
cates in the interplasmid category were orphans (57% 
and 50%, respectively) which may be the products of 
HGT. We conclude that the atypical, largely plasmid-medi- 
ated duplication dynamics of Acaryochloris genomes gener- 
ate copy number variation among loci of both ancestral and 



foreign origin, that this variation is frequently nonadaptive, 
but that it also is an important source of locally adaptive ge- 
nomic variation with the potential to rapidly respond to en- 
vironmental change. 

In addition to modifying gene dosage, duplication also 
creates opportunities for the evolution of novel gene func- 
tions. Whether neofunctionalization contributed to the in- 
novation of the unique chlorophyll metabolism of 
Acaryochloris remains unresolved, as the details of Chi d bio- 
synthesis and degradation are yet to be fully elucidated. Chi 
d differs from Chi a by the replacement of a vinyl group with 
a formyl group at C-3 of the porphyrin ring. The pigment is 
produced from Chi a and molecular oxygen precursors 
(Schliep et al. 2010), and biochemical evidence suggests 
that the "Chi d synthase" that performs this reaction is 
a P450 oxygenase (Chen 2010). The genomes of both 
strains each harbor ten genes encoding P450 enzymes; 
however, none of the copies appear to be recent duplicates 
(not shown). We analyzed the pool of duplicates retained by 
both genomes for paralogs with homology to other proteins 
that could potentially participate in other aspects of Chi 
d metabolism such as porphyrin ring degradation. The only 
candidates to emerge were a pair of divergent (d s « 1 .9) 
duplicates with homology to a family of Rieske-FeS motif- 
containing oxygenases involved in chlorophyll synthesis 
and degradation (ORFs 0307/5640 in CCMEE 5410 and 
0159/A0067 in MBIC 11017). It is notable that A0067 is 
found within one of the few regions of extensive synteny 
between a MBIC 11017 plasmid and the CCMEE 5410 ge- 
nome (including A0036-A0053 and A0066-A0075). 
Whether one of these paralogs has diverged to specifically 
degrade Chi d awaits further investigation. 

Supplementary Material 

Supplementary table S1 is available at Genome Biology and 
Evolution online (http://www.gbe.oxfordjournals.org/). 
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